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Abstract 

Distributed  memory  multiprocessor  systems  can  provide  the  computing  power 
necessary  for  large-scale  scientific  applications.  A  critical  performance  issue  for  a 
number  of  these  applications  is  the  efficient  transfer  of  data  to  secondary  storage. 
Recently  several  research  groups  have  proposed  FORTRAN  language  extensions 
for  exploiting  the  data  parallelism  of  such  scientific  codes  on  distributed  mem¬ 
ory  architectures.  However,  few  of  these  high  performance  FORTRAN'S  provide 
appropriate  constructs  for  controlling  the  use  of  the  parallel  I/O  capabilities  of 
modern  multiprocessing  machines.  In  this  paper,  we  propose  constructs  to  specify 
I/O  operations  for  distributed  data  structures  in  the  context  of  Vienna  Fortran. 
These  operations  can  be  used  by  the  programmer  to  provide  information  which  can 
help  the  compiler  and  runtime  environment  make  the  most  efficient  use  of  the  I/O 
subsystem. 


Fie'  work  described  in  this  paper  is  being  carried  out  as  part  of  the  1*2702  FSI’RIT  research  project 
"An  Automatic  Parallelization  System  for  (ieuesis  funded  l>y  the  Austrian  Ministry  for  Science  and 
Research  (BMW  f  )  I  he  research  was  also  supported  |>y  the  National  Aeronautics  and  Space  Adminis¬ 
tration  under  NASA  contracts  NASI- Istitlfi  and  NASI- lttfSO  while  some  of  the  authors  were  in  residence 
at  ICASK.  Mail  Stop  1320.  NASA  Langley  Research  Center.  Hampton.  VA  23(>KI. 
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1  Introduction 


Distributed  memory  multiprocessors  (DMMPs),  such  as  Intel's  Paragon  and  Thinking 
Machines'  ( 'Mb,  provide  an  attractive  approach  to  high  speed  computing  particularly 
because  their  performance  can  he  easily  scaled  up  by  increasing  the  number  of  proces¬ 
sors.  The  I/O  bottleneck  has  been  somewhat  alleviated  in  these  systems  by  powerful 
Concurrent  Input/Output  Systems  (ClOSs)  ([1.  2.  3.  9,  lb]). 

Hardware  and  software  architectures  of  ClOSs  provided  by  various  DMMP  vendors 
differ  substantially.  For  example,  the  Concurrent  File  System7"'*  (CFS)  developed  by 
Intel  for  the  iPSC/2  and  iPSO/860  supercomputers  [15]  is  based  on  an  architecture 
which  is  straightforward  to  use  while  delivering  high  speed  concurrent  access  to  large 
data  sets.  The  CFS  is  based  on  a  technique  called  striping.  The  striping  scheme  allows 
a  single  file  to  be  spread  across  multiple  disks  (striped)  [3]  so  as  to  improve  access  speed 
and  decrease  congestion  in  communication  links.  Striping  is  done  at  the  logical  block 
level.  For  example  the  even  numbered  logical  blocks  of  a  file  may  be  allocated  to  disk  0 
while  the  odd  numbered  logical  blocks  are  located  on  disk  1. 

Despite  significant  advances  in  hardware,  programming  DMMPs  has  been  found  to 
be  relatively  difficult.  Data  and  work  have  to  be  distributed  among  the  processors  and 
explicit  message  passing  has  to  be  used  to  access  remote  data. 

In  recent  years,  several  languages  extensions  have  been  proposed  to  provide  a  high 
level  environment  for  porting  data  parallel  scientific  codes  to  DMMPs.  The  fundamen¬ 
tal  goal  with  these  approaches  is  to  allow  the  user  to  specify  the  code  using  a  global 
index  space  while  providing  annotations  for  specifying  the  distribution  of  data.  Thu 
compiler  then  analyses  such  high  level  code  and  restructures  the  code  into  an  SPMD 
(Single  Program  Multiple  Data)  program  for  execution  on  the  target  distributed  mem¬ 
ory  multiprocessor.  Work  distribution  is  based  on  the  owner-computes  rule  and  non-local 
references  are  satisfied  by  inserting  appropriate  message  passing  statements  in  the  gen¬ 
erated  code  [7,  11,  19]. 

Most  of  these  efforts  are  extensions  of  FORTRAN  77  [5.  b.  12.  13,  IS.  20]  or  FOR¬ 
TRAN  90  [4.  13]  and  we  collectively  refer  to  them  as  as  high  performance  FORTRAN’S. 


Recently,  a  coalition  of  groups  from  industry,  government  labs  and  academia  formed  the 
High  Performance  FORTRAN  Forum  to  design  a  standard  set  of  extensions  to  FOR-  ® 

TRAN  90  along  the  lines  described  above  [8]. 

Among  these  languages,  only  Vienna  Fortran  [20]  and  MPP  [11]  provide  support  for  ~ 
OIOSs.  However,  efficient  use  of  the  OIOSs  is  crucial  for  many  applications,  such  as  pro-  _ 

cessing  of  seismic  data  and  simulations  of  oil  fields,  and  largely  dictates  the  performance _ 
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of  the  whole  program. 

Consider  the  situation  where  one  program  writes  out  an  array  and  another  program 
then  reads  the  data  into  a  target  array.  If  we  use  the  standard  FORTRAN  write  statement 
to  output  tlie  array  to  a  sequential  access  file  (most  scientific  codes  use  sequential  access 
file's  only),  the  data  elements  are  written  out  in  the  column-major  order  as  defined  by 
F  ORTRAN.  When  the  source  array  is  distributed  across  a  set  of  processors,  the  processors 
need  to  synchronize  and  generally  execute  serially  in  order  to  preserve  this  sequence  when 
writing  out  the  elements  of  tin*  array  to  secondary  storage. 

However,  it  the  target  array  is  distributed  in  the  same  manner  as  the  source,  both  the 
output  and  the  input  may  become  more  efficient  if  we  do  not  maintain  this  presmbed 
sequential  order.  For  example,  each  processor  may  -  in  parallel  -  write  out  the  piece  of 
the  array  that  it  owns,  as  a  contiguous  block.  This  data  can  then  be  subsequently  read 
-  also  in  parallel  -  into  a  similarly  distributed  target  array. 

If  the  distributed  arrays  are  being  written  and  read  in  the  same  program,  for  example 
as  scratch  arrays,  then  the  compiler  knows  the  distribution  of  the  target  array.  In  such 
a  situation,  it  can  choose  the  best  possible  order  of  elements  on  external  storage  so 
as  to  make  both  the  input  and  output  efficient.  However,  in  general,  files  are  used 
to  communicate  data  between  programs.  In  such  situations,  the  compiler  and  runtime 
system  do  riot  have  any  information  ah  out  the  distribution  of  the  target  array  and  hence 
will  have  to  use  the  standard  order  for  the  elements. 

In  this  paper,  we  propose  constructs  which  enable  the  user  to  provide  some  informa¬ 
tion  about  how  tin*  data  to  be  stored  in  the  files  is  going  to  he  subsequently  used.  This 
information  allows  the  compiler  to  parallelize  read  and  write  operations  for  distributed 
arrays  by  selecting  a  well  suited  data  sequence  in  the  files.  Note  that  the  language 
constructs  described  here  operate  on  whole  arrays  rather  than  sections  of  arrays. 

We  present  the  concurrent  I/O  operations  in  the  context  of  Vienna  Fortran,  a  high 
performance  FORTRAN  language.  Section  2  introduces  some  Vienna  Fortran  language 
constructs  for  specifying  data  distribution  along  with  a  formal  model  to  describe  these 
distributions.  The  next  section  presents  the  concurrent  1/0  operations  being  proposed 
here,  while  Section  1  provides  some  performance  numbers  to  justify  the  need  for  these 
operations. 

2  The  Vienna  Fortran  Distribution  Model 

In  this  section  we  present  the  basic  language’  features  of  Vienna  Fortran.  A  full  description 
of  the  entire  language  can  be  found  in  [20].  Vienna  Fortran  is  based  on  a  machine  model 
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M.  consisting  of  a  set  P  of  processors  with  local  memory,  interconnected  by  some  network, 
and  a  high  performance  file  system.  A  Vienna  Fortran  program  Q  is  executed  on  AA  by 
running  the  code  produced  by  the  Vienna  Fortran  Compiler  on  each  processor  of  M. 
This  is  called  an  SPMD  (Single-Program-Multiple-Data)  model  [10]. 

Code  generation  is  guided  by  a  mapping  of  the  internal  data  space  of  Q  to  the 
processors.  The  internal  data  space  A  is  the  set  of  declared  arrays  of  Q  (scalar  variables 
can  be  interpreted  as  one-dimensional  arrays  with  one  element). 

Definition  1  Let  A  £  A  denote  an  arbitrary  array.  The  index  domain  of  A  is  denoted 
by  I4.  The  shape  of  the  index  domain  I.  shape{  I),  provides  the  extents  in  each  dimension 
of  the  domain. 

2.1  Processors 

In  Vienna  Fortran  processors  are  explicitly  introduced  by  the  declaration  of  a  processor 
array.  The  notationa!  conventions  introduced  for  arrays  above  can  also  be  used  for 
processor  arrays,  i.e.,  IH  denotes  the  index  domain  of  a  processor  array  R. 

2.2  Distributions 

A  distribution  of  an  array  maps  each  array  element  to  one  or  more  processors  which 
become  the  owners  of  the  element,  and,  in  this  capacity,  store  the  element  in  their  local 
memory.  We  model  distributions  by  functions  between  the  associated  index  domains. 

Definition  2  Distributions 

Let  A  €  A,  and  assume  that  R  is  a  processor  array.  A  distribution  of  the  array  A 
with  respect  to  R  is  defined  by  the  the  mapping: 

6%:IA  -  V(lr')-6 

where  V{IH)  denotes  the  power  set  of  IH . 

In  Vienna  Fortran  the  distribution  of  an  array  is  specified  by  annotating  the  array 
declaration  with  a  distribution  expression.  For  example. 

REAL  A,(..),  ....  Ar(..)  DIST  dex  TO  procs 

declares  arrays  A,,  1  <  i.  <  r.  The  distribution  expression  d,  j  defines  the  distribution  of 
the  arrays  in  the  context  of  the  given  processor  set  procs.  The  distribution  expression 
in  its  simplest  form  consists  of  a  sequence  of  distributions,  one  for  each  dimension  of  the 


data  array.  A  set  of  intrinsic  distribution  functions  are  provided,  including  the  commonly 
occurring  block  distribution  which  maps  contiguous  (dements  to  a  processor  and  the  cyclic 
distribution  which  maps  elements  cyclically  to  the  processor  set. 

Examples  of  arrays  with  some  basic  distributions  are  given  below: 

PROCESSORS  R2D(  16.16) 

REAL  A (256. 1096)  DIST  (  BLOCK.  BLOCK)  TO  R2D 
REAL  B (256. 1096.16)  DIST  (  CYCLIC,  BLOCK.:)  TO  R2D 

The  first  statement  declares  an  16  x  16  two-dimensional  processor  array.  R'ID.  The 
array  A  s  declared  to  be  distributed  such  that  its  first  dimension  is  partitioned  into 
blocks  of  size  16.  while  the  second  dimension  is  partitioned  into  blocks  of  size  256.  These 
blocks  are  mapped  to  the  corresponding  processors  of  R2D:  for  example,  the  segment 
.•1(19  :  6-1.38-11  :  1096)  is  assigned  to  the  processor  R2l)(l.  16). 

The  in  the  distribution  expression  of  B  specifies  that  this  dimension  is  not  dis¬ 
tributed.  only  the  first  two  dimensions  are  distributed  across  the  two  dimensions  of  the 
processor  array. 

Vienna  Fortran  supports  a  wide  range  of  facilities  for  distributing  and  aligning  data 
arrays.  Full  details  of  these  and  other  features  of  the  language,  including  examples,  can 
be  found  in  [5.  20]. 

3  Concurrent  Input /Output  Operations 

In  this  section,  we  describe  tin'  concurrent  file  operations  provided  in  Vienna  Fortran. 
The  language  distinguishes  between  two  types  of  files:  standard  files  and  array  files. 
Standard  files  are  accessed  via  standard  FORTRAN  I/O  statements  whereas  array  files 
can  be  accessed  via  concurrent  I/O  operations  only. 

The  first  subsection  describes  the  structure  of  the  array  files.  The  concurrent  file 
operations  are  informally  introduced  in  the  next  subsection  while  the  last  subsection 
specifies  t liese  operat ions  formally. 

3.1  Array  Files 

Input  /Output  statements  control  the  data  How  between  program  variables  and  the  file 
system.  The  file  system  of  machine  ,Vf  may  reside  physically  on  a  host  system  and/or  a 
CIOS. 

Definition  3  I'ht  Jilt  systt  to  J~  of  machint  W  is  dtjintd  by  Iht  union  of  a  sit  of  standard 
l()R  I  RA.\  Jilts  J~  si  and  a  sit  of  array  Jilts  Tahh ■ 


When  transferring  elements  of  a  distributed  array  to  an  array  file,  each  processor  does 
input/output  operations  controlling  the  transfer  of  the  local  part  of  the  array  to  or  from 
the  corresponding  part  of  the  file.  A  suitable  file  structuring  is  necessary  to  achieve  high 
transfer  efficiency. 

Array  files  in  Vienna  Fortran  may  contain  values  from  more  than  one  array.  There¬ 
fore,  array  files  are  structured  into  records.  Each  record  contains  an  array  distribution 
descriptor  followed  by  a  sequence  of  data  elements  associated  with  this  array. 

Definition  4  An  array  file  F  £  Farr  «*'  «  sequence  of  distributed  array  records 
<  dart('i.  darer-i,  ■ .  ■  >  Each  it  cord  can  be  associated  with  a  distributed  array,  A.  and 
has  flit  form  (f4,  Oa  ).  where 

•  rA  is  a  distribution  descriptor  and  has  the  structure  (  I4,  I/(.  t>A).  Hire.  Ff)  is  the 
distribution  used  for  writing  out  tin  sequence  of  data  elements  and  lA  and  1H  an 
the  unde  dying  array  and  processor  index  domains,  respectively,  used  for  defining 
this  distribution. 

•  Oa  is  a  seque  nce  of  data  elements  stored  in  this  record. 

3.2  I/O  Operations 

Concurrent  I/O  operations  supported  bv  Vienna  Fortran  can  be  classified  into  three 
groups:  data  transfer,  inquiry  and  file  manipulation  operations.  These  operations  deal 
with  whole  arrays  which  are  distributed  across  a  set  of  processors.  Thus,  a  global  syn¬ 
chronization  of  the  processors  is  required  before  they  cooperate  to  execute  the  operation. 

In  this  subsection,  we  informally  describe  the  concurrent  I/O  operations  supported 
by  Vienna  Fort  ran. 

Writing  to  a  File 

The  concurrent  write  statement.  CWRITE.  can  be  used  to  write  multiple  arrays  to  a  file 
in  a  single  statement.  For  each  array  a  distributed  array  record  is  written  onto  the  file. 
Vienna  Fortran  provides  three  forms  of  the  concurrent  write  statement.  These  affect  the 
order  of  data  elements  written  out  to  the  distributed  array  record. 

(i)  In  the  simplest  form,  the  individual  distributions  of  the  arrays  determine  the 
sequence  of  array  elements  written  out  to  the  file.  For  example,  in  the  following  statement : 

CWRITE  (/)  A,.  A 2 . Ar 


where  /  denotes  the  I/O  unit  number  and  A„ 
1  <  /  <  r  are  array  identifiers.  This  form  should  be  used  when  the  data  is  going  to 
be  read  into  arrays  with  the  “same"  distribution  as  A,.  In  this  situation,  the  sequence 
of  elements  in  the  file  are  generated  by  concatenating  the  linearized  local  segments  of 
.1  owned  by  the  individual  processors  according  to  the  increasing  order  of  the  linearized 
index  of  the  processors.  This  is  the  most  efficient  form  of  writing  out  a  distributed  array 
since  each  processor  can  independently  (and  in  parallel)  write  out  the  piece  of  the  array 
that  it  owns,  thus  utilizing  the  I/O  capacity  of  the  architecture  to  its  fullest. 

(ii)  Consider  the  situation  in  which  tin*  data  is  to  be  read  several  times  into  an  array 
B.  where  the  distribution  of  B  is  different  from  that  of  the  array  being  written  out. 
In  this  case,  the  user  may  wish  to  optimize  the  sequence  of  data  elements  in  the  file 
according  to  the  distribution  of  the  array  B  so  as  to  make  the  multiple  read  operations 
more  efficient.  Additional  parameters  of  the  CWRITE  statement  enable  the  user  to 
specify  (a)  the  shape  of  the  distributed  array  to  which  the  read  operation  will  be  applied, 
and  (b)  its  distribution.  These  additional  specifications  can  then  be  used  by  the  compiler 
to  determine  the  sequence  of  elements  in  the  output  file. 

If  a  shape  is  specified,  the  size  of  the  arrays  .4) . 4„  has  to  be  equal  to  the  product 

of  t lit*  extents  of  t  he  specified  index  domain.  The  resulting  rank  and  shape  have  to  match 
the  distribution  specification.  For  example,  the  following  statement  can  be  used  if  .4  is 
a  two  dimensional  array. 

CWRITE  (f.  PROCESSORS=‘R2l)(N.N)\ 
k  DIST='( BLOCK. CYCLIC)  TO  R2l)‘)  .4 

Here,  the  elements  of  the  array  A  are  written  so  as  to  optimize  reading  them  into 
an  array  which  is  distributed  as  (BL()('I\,  ('Y('LK').  Depending  on  the  sequence  to  be 
written,  the  processors  (a)  could  synchronize  so  as  to  execute  t lit*  correct  sequence  of  the 
individual  writes  to  secondary  storage,  or  (h)  could  incur  the  overhead  of  redistributing 
the  data  internally  before  using  a  parallel  write  operation  to  output  the  data. 

(iii)  If  the  data  in  a  file  is  to  be  subsequently  read  into  arrays  with  different  distribu¬ 
tions  or  there  is  no  information  available  about  the  distribution  of  the  target  arrays,  the 
user  may  allow  the  compiler  to  choose  the  sequence  of  the  elements  to  be  written  out. 
This  is  done  by  specifying  SYSTEM'  as  the  distribution  in  the  CWRITE  statement: 

CWRITE  (f.  DIST= 'SYSTEM')  .4, . Ar 

This  allows  the  compiler  and  the  runtime  system  to  cooperate  to  determine  the  best 
possible  sequence  for  writing  out  the  data,  given  that  there  is  no  knowledge  about  dis¬ 
tribution  of  the  target  arrays. 
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Reading  from  a  File 


A  read  operation  to  one  or  more  distributed  arrays  is  specified  by  a  statement  of  the 
following  form: 


CREAD  {/)  By,  B2 .  Bt 

where  again  /  denotes  the  I/O  unit  number  and  Bt,  1  <  ?  <  r  are  array  identifiers.  The 
operation  reads  the  next  r  distributed  array  records  in  /.  The  data  elements  of  the  j'tli 
record  are  read  into  B,.  Note  that  the  semantics  of  standard  FORTRAN  1/0  operations 
has  to  be  maintained.  That  is,  if  an  array  A  is  written  out  to  a  file  and  then  read  into 
another  array  B,  the  column-major  linearization  of  FORTRAN  arrays  will  determine 
which  element  of  A  is  read  into  a  given  element  of  B.  The  actual  transfer  of  data.  thus, 
is  done  by  taking  into  account  the  distribution  descriptor  of  the  ith  record  and  the  shape 
and  the  distribution  of  B,. 

Accessing  a  Distribution  Descriptor 

The  distribution  descriptor  of  the  current  distributed  array  record  in  the  file  can  be 
accessed  as  follows: 


CDISTR  (/) 

The  value  returned  by  this  function  can  be  used  in  the  special  Vienna  Fortran  case1 
statement,  DCASE,  which  allows  derisions  to  be  made  based  on  the  value  of  the  distri- 
1 » u t  ion  descriptors. 

Other  Operations 

•  COPEN  (colist)  -  Open  an  array  file. 

•  CCLOSE  (relist)  -  Close  an  array  file'. 

•  CSKIP  (/.*)  -  Skip  to  the  end  of  file. 

•  CSKIP  (/,  n)  -  Skip  n  distributed  array  records. 

•  CBACKARRAY  (/)  -  Move  back  to  the  previous  array  record. 

•  CREWIND  (/)  -  Rewind  the  file. 

'The  operations  COPEN  and  CCLOSE  have  the  same  meaning  and  the  lists  rohst  and  crhsl  have 
the  same  form  as  their  counterparts  in  FORTRAN  77. 
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•  CEOF  (/)  -  Check  for  end  of  file. 


Note  that  the  c  oncurrent  I/O  operations  supported  by  Vienna  Fortran  can  be  applied 
only  to  the  special  array  hies  defined  here,  and  conversely  array  hies  can  only  be  accessed 
through  these  operations. 

3.3  Operations  on  Array  Files 

In  this  subsection,  we  formally  describe  the  semantics  of  the  concurrent  operations  on 
array  files. 

Definition  5  I.  A  file  F  rati  be  viewed  as  a  concatenation  of  two  sequences* 

F  =  F  cat  F 

when  F  is  part  of  the  file  which  has  already  been  processed  and  F  is  the  rest  of  the 
file.  These  compone  nts  are  not  directly  accessible  to  the  programmer,  cat  is  the 
operation  of  sequence  concatenation. 

J.  The  null  sequence  is  de  noted  by  the  symbol  <>. 

■  1.  If  n  —  tuple  =  (ifeui\.item2 . iteinn), 

then  ii  —  tuple  j.  /  is  the  ith  item  of  n  —  tuple . 

j.  If  F  =  <  re  Ci,  re  (2 _ _  recn . recm  >  for  0  <v<  rn,  then 

first(F)  -  re  ci 

rest(F)  =  <  rec2,  ...,rec,n  > 

last(F)  =  rec,n 

ti'il houtlastf  F)  =  <  rcc, . rccm_i  > 

firstn(F.i) )  =  <  reej, ....  7-crn  > 

without firshi(F. it)  =  <  rrc,(+j . rec,n  > 

5.  H  .4=  Oa  means  the  transfer  of  the  file  data  elements  denoted  by  Oa  to  the  dis¬ 
tributed  array  B. 

ti.  B  reorder  :<=  (Oa,  i>'A.  ii,H)  has  the  following  interpretation:  Oa  is  read  from 
the  file,  then  it  is  reordered  into  an  intermediate  sequence  which  matches  v H ,  and 
finally,  this  sequence  is  transferred  to  the  distributed  array  B  of  the  program. 

»Thr  formal  specification  of  the  array  file  operations  presented  in  this  subsection,  is  partly  based  on 
t  fie  tile  model  proposed  by  Tennent  [17]. 


7.  Let  il>A  =  and  ij>B  —  (Is , Ifi2 , 8^2) .  An  equivalence  relation,  =.  is 

defined  among  distribution  descriptors.  We  say,  d’A  =  V’fl-  iff  sb.apc  (F*)  =  shape 
(Is),  shape  (Iftl)  =  shape  (Ifl2)  and  the  two  distributions  and  S^2  are  equivalent. 

S.  i/o-operation (F)  A\,  A2, ....  Am  is  equivalent  to: 

i/o  —  operation(F)  A\ 
i /o  —  operatinn(F)  A2 

i/o  —  opt  rati  on  ( F)  A,n 

Definition  6  Array  file  operations  are  defined  as  follows^ : 

1.  Data  transfer  operations 

•  CWRITE  (F)  /I  is  equivalent  to: 

F  ■=  f  cat  <  (vA.  Oa  )  > 

F  :=  <> 

•  CWRITE  (F,  SHAPE  =  '( F, . F  J PROCESSORS  =  7/ f  V, . Y, J 

DIST=  dt  i ')  A 
is  equivalent  to: 

F  :=  F  cat  <  (»/’neu\  Oa  )  > 

F  :=  <> 

where 

V'ntu’  =  (rVfc'H’.I  H.ele.c) 
lNEW  =  [I  :  F,]  x  -  x  [1  :  F„] 

•  CWRITE  (F.  DIST  =  ’.S'  YS TEM ')  A 
is  equivalent  to: 

F  :=  F  cat  <  (  Oa  )> 

F  :=  <> 

where  ?'.s  implementation  defined. 

•  CREAD  (F)  B  is  equivalent  to: 


‘Auxiliary  operations,  such  as  opening  ami  closing  files,  are  not  included  in  the  formal  definition  here. 


if  (il)A  =  tpB)  then 
B  first(F)  |  2 

else 

B  reorder  :<S=  (first(F)  J.  2,ij>A  ,ij>B) 

endif 

F  F  cat  first(F ) 

F  :=  res<(F) 

where  4’ A  =  (first(F)  |  1) 

2.  Inquiry  operations 

•  CDISTR  (F)  is  equivalent  to: 

first(F)  i  1 

•  CEOF (F)  is  equivalent  to: 

F  =  <> 

3.  File  manipulation  operations 

•  CREWIND  (F)  is  equivalent  to: 

F  :=  F  cat  F 
F  ■=  <> 

•  CBACK ARRAY  (F)  is  equivalent  to: 

F  :=  last(F)  cat 
F  '■=  wit.houtlast{ 

•  CSKIP  (F,  *)  is  equivalent  to: 

F  ■=  F  cat  F 

F  :=  <> 

•  CSKIP  (F,  n)  is  equivalent  to: 

F  :  =  F  cat  firstn(F,n) 
F  :=  without  fir  stn(F,n) 
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4  Performance 


In  this  section,  we  present  some  performance  measurements  to  justify  the  need  for  user 
control  over  the  manner  in  which  data  from  distributed  arrays  is  transferred  to  and  from 
secondary  storage. 

Consider  the  following  declarations: 

PARAMETER  (NP  =  ...) 

PARAMETER  (N  =...) 

PROCESSORS  P(NP,  NP) 

REAL  A(N,  N)  DIST  (  BLOCK,  BLOCK) 

Here,  .4,  is  a  A  x  N  array,  block  distributed  in  both  dimensions  across  an  A  P  x  A  P 
processor  array.  Figure  1  shows  the  distribution  of  elements  of  the  array  .4  for  the  case 
of  A  —  4  and  A ' B  —  2. 


P(ld) 


P(2.1] 


A(l.l)  A (1,2) 

A(  1,3)  A(  1 ,4) 

A(2,l)  A(2,2) 

A(2,3)  A(2,4) 

A ( 3 , 1 )  A(3.2) 

A(3,3)  A (3,4 j 

A(4.1)  A(4,2) 

A(4,3)  A ( 4 .4 ) 

P(L2) 


P(2.2) 


Figure  1:  A  two-dimensional  block  distributed  array 

If  such  an  array  is  written  out  using  a  standard  FORTRAN  write  statement,  the  se¬ 
mantics  enforce  the  column-major  linearization  of  the  data  elements.  This  would  require 
close  synchronization  of  the  processors  owning  A  to  execute  the  write  statement.  Besides 
this  serialization,  another  drawback  is  that  each  processors  writes  only  small  blocks  of 
the  individual  columns.  On  most  systems,  such  as  the  iPSC/860,  the  best  performance 
for  I/O  operations  is  reached  for  large  blocks.  The  same  inefficiencies  recur,  if  the  data 
has  to  be  subsequently  read  into  a  similarly  block  distributed  two-dimensional  array. 

On  the  other  hand,  if  we  use  the  simplest  form  of  the  CWRITE  statement  as  proposed 
in  the  last  section,  the  sequence  of  the  data  elements  in  the  tile  would  be  as  follows: 


A(  1.1  ).A(2.1  ).A(  1.2).A(2.2).A(3.1  ).A(-1.1  ).A(3.2).A(-1,2).A( 1.3).  A(2,3).A(  1 A ). 

A(2. 1).A(3.3).A(  1.3),A(3.-1),A(  1. 1) 

Each  process  can  thus  write  its  local  elements  as  one  block,  in  parallel  with  the 
other  processes.  Similarly,  reading  the  data  into  a  similarly  distributed  array  can  also  be 
executed  in  parallel. 

In  order  to  determine  the  overheads  involved  in  writing  tint  an  array  distributed 
as  described  above,  we  implemented  five  versions  of  the  write  statement  on  the  Intel 
il\S( '/iS(iO.  The  system  consists  of  32  processing  nodes  and  1  I/O  nodes  using  CFS  to 
manage  the  tile  system. 

lhe  first  four  versions  of  our  experiment,  preserve  the  standard  FORTRAN  lineariza¬ 
tion  order,  while  the  last  uses  the  sequence  suggested  above. 

In  the  lirst  implementation.  ('ESI.  each  process  sends  its  local  block  of  elements  to 
a  designated  process  which  collects  the  entire  array.  T  his  central  process  then  writes  the 
array  out  to  the  CFS  using  a  standard  FORTRAN  write  statement. 

The  next  three  implementations.  SEQO.  SEQ1  and  SEQ2  again  preserve  the  column- 
major  linearization  of  the  array  and  use  CFS’s  file  modes  0.  1  and  2  respectively  [!)].  to 
write  out  the  array.  In  SEQO.  each  process  manages  its  own  file  pointer.  All  processes 
write  unsynchronized  to  the  same  file.  They  position  their  file  pointer  to  the  appropriate 
position  in  the  file  for  each  subcolumn  that  they  have  to  output. 

The  processes  work  with  a  common  file  pointer  in  version  SEQI  and  thus  have  to  he 
closely  synchronized.  For  each  part  of  a  column,  the  appropriate  process  performs  the 
write  while  the  other  processes  are  waiting. 

In  SEQJ .  the  write  operations  are  executed  as  collective  operations.  The  columns 
are  written  sequentially.  Thus,  each  process  which  owns  a  part  of  the  column  writes  its 
part.  Other  processes  perform  the  write  with  zero  length  information.  The  information 
written  in  such  a  collect ive operation  is  ordered  in  the  output  file  according  to  the  process 
numbers. 

The  last  version.  SEW.  uses  the  implementation  suggested  in  this  paper.  That  is. 
instead  of  writing  out  the  data  in  the  column-major  order,  each  process  writes  out  its 
local  piece  as  contiguous  block.  The  processes  perform  a  single  collective*  write  using  the 
( 'FS's  file  mode  3. 

fable  1  shows  the  times  measured  for  a  1000  x  1000  array  distributed  blockwise 
across  a  I  x  I  processor  array.  Since  the  performance  depends  heavily  on  whether  the 
file  to  be  written  exists  prior  to  the  operation  or  not,  we  present  timings  for  both  cases. 
The  problem  is  that  if  the  file  dot's  not  already  exist,  new  disk  blocks  have  to  be  allocated 
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Version 

Including 
file  creation 

Pre-existinga* 

file 

CENT 

IT 

1.0 

SEQO 

5  2.2 

0.5 

SEQ1 

43.9 

7.0 

SEQ2 

4  2.3 

4.4 

NEW 

1.9 

i  .() 

Tal.lt*  1:  lime  (in  sees)  for  writing  out  a  (list ributed  array 

(“Very  time  the  file  is  extended.  This  is  particularly  an  issue  with  the  versions.  SEQO. 
SEQI  and  SEQ2  since  each  individual  write  for  a  part  of  the  column  extends  the  file. 

It  is  clear  from  our  experiments,  that  at  least  on  the  iPSC/SOO.  that  the  version 
V£’H  performs  better  than  the  rest  of  the  implementations.  T  his  indicates  that  I/O 
bound  applications  running  on  distributed  memory  machine  may  achieve  much  better 
performance  it  the  user  can  provide  information  which  would  help  the  compiler  and 
runtime  system  to  choose  the  best  possible  sequence  of  the  data  elements  written  out  to 
secondary  storage. 

I  he  concurrent  I/O  operations  described  in  the  last  section  are  currently  being  inte¬ 
grated  into  the  Vienna  Fortran  Compilation  System.  We  will  report  on  tin*  performance 
of  these  operations,  in  the  context  of  actual  applications,  at  a  later  date. 

5  Conclusions 

Vendors  of  massively  parallel  systems  usually  provide  high-capacity  parallel  I/O  subsys¬ 
tems.  Efficient  usage  of  such  subsystems  is  critical  to  the  performance  of  I/O  bound 
application  codes.  In  this  paper,  we  have  presented  language  constructs  to  express  par¬ 
allel  I/O  operations  on  distributed  data  structures.  These  operations  can  be  used  by 
the  programmer  to  provide  information  which  will  allow  the  compiler  and  runtime  envi¬ 
ronment  to  optimize  the  transfer  of  data  to  and  from  secondary  storage.  The  language 
constructs  presented  here  have  been  proposed  in  the  context  of  Vienna  Fortran,  however, 
they  can  be  easily  integrated  into  any  other  high  performance  FORTRAN  extension. 
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